1 Data description

In this tutorial, we aim to obtain differentially expressed genes between two biological conditions in an integrated data. This integrated data set is generated by combining 10 microarray studies on control and TGFb-treated samples (for more information, see here http://mcr.aacrjournals.org/content/15/5/619). 

First, we assess the presence of unwanted variation in the integrated data which combines differents studies and platforms. Second, we show how to use two RUV methods: RUVinv and RUV4 to remove unwanted variation and detect differentially expressed genes comparing control samples to TGFb-treated samples. Third, we assess whether RUV methods are useful and compare the results obtained by each method.

library(ruv)
Error in library(ruv) : there is no package called ‘ruv’

The integrated data introduced above have been split into two data sets: samples A and samples B, each containing different studies, platforms and tissues. We will explore and normalise these two data sets separately in order to compare the results obtained by the two normalisation methods (RUV4 and RUVinv).

Read in the two integrated datasets.

rr

samplesA<- read.table(\expr_10data_sampleA.txt\, header = T, sep = \\t\)
samplesB<- read.table(\expr_10data_sampleB.txt\, header = T, sep = \\t\)

Look at the data in each samples A and B, then make a matrix for each these data where the row names are gene IDs.

rr

head(samplesA,3)
head(samplesB,3)

r

mA<-samplesA[,2:dim(samplesA)[2]]
mB<-samplesB[,2:dim(samplesB)[2]]
row.names(mA)<- samplesA[,1]
row.names(mB)<- samplesB[,1]
mA<- as.matrix(mA)
mB<- as.matrix(mB)

Look at the information related to each sample including the name of the studies, types of platform, treatment, and tissue:

info_samplesA<- read.table("info_10data_sampleA.txt", sep="\t", header=T)
info_samplesA
info_samplesB<- read.table("info_10data_sampleB.txt", sep="\t", header=T)
info_samplesB

2 Assessment of unwanted variation in the data

Here we perform some exploratory analysis on the integrated data to assess the presence of unwanted variation in each dataset.

2.1 RLE plot

We start by looking at the RLE plots in samples A data, coloured by study, platform and tissue:

## Transpose the expression matrix so that we have genes in columns and samples in rows
YA <- t(mA)
## Plot RLE coloured by study
ruv_rle(YA, info_samplesA$study, ylim = c(-4,4))

## Plot RLE coloured by platform
ruv_rle(YA, info_samplesA$platform, ylim = c(-4,4))

## Plot RLE coloured by platform
ruv_rle(YA, info_samplesA$tissue, ylim = c(-4,4))

Similarly, we look at the RLE plots in sample B data coloured by study, platform and tissue:

YB <- t(mB)
## Plot RLE coloured by study
ruv_rle(YB, info_samplesB$study, ylim = c(-4,4))
## Plot RLE coloured by platform
ruv_rle(YB, info_samplesB$platform, ylim = c(-4,4))

## Plot RLE coloured by tissue
ruv_rle(YB, info_samplesB$tissue, ylim = c(-4,4))

2.2 PCA plot

In transcriptomics applications, one of the most utilized exploratory plots is the multi-dimensional scaling (MDS) plot or a principal component analysis (PCA) plot. To assess the presence of unwanted variation in each sample, we use PCA plots to show similarities between samples measured in an unsupervised way. Ideally, samples should cluster together according to the treatment (i.e. the biological factor of interest). Here, we see that samples are rather clustered by studies (i.e. unwanted variation) in both samples A and B data.
In the current example, it’s important to note that in some cases, different studies are confounded with different platforms and tissues, and therefore there is no way to identify how much of the unwanted variation come from each of these factors. Such situations must be avoided when designing an experiment and for the purpose of this tutorial, we only consider “study” as the source of unwanted variation.

3 Remove batch effects using RUV methods

There are several RUV methods for removing unwanted variation in order to obtain DEGs; these include RUV-2, RUV-4, RUV-inv and RUV-rinv. In general, RUV methods are dependent on negative control genes (genes which are not associated with the biological factor of interest) and replicate samples (if applicable). RUV-2 removes unwanted variation in two steps. RUV-4 came after RUV-2 and has four steps. For RUV-4 we can estimate the dimension of unwanted variation (k) or select different values for k, while for RUV-inv and RUV-rinv we don’t need to estimate k as it is set to be the maximum value. In general, RUV-inv and RUV-rinv are better than RUV-4. RUV-inv is recommended when we have large number of control genes (~1000), while RUV-rinv is more appropriate with small number of control genes (~60).
Selection of negative control genes is very important. Examples of negative control genes are the spike-in controls or the housekeeping (HK) genes. It is also possible to define empirical negative control genes using an iterative approach. However, it is only recomended if (i) the initial negative control genes are not very good or there are only a few of them; (ii) the beta seems to be very sparse. Therefore, the user needs to generate diagnostic plots to assess the performance of the initial analysis, and only if needed, use an iterative approach to define better control genes.

3.1 RUVinv

Here, we first focus on RUV-inv, which is performed by RUVinv() function. This function take expression matrix (Y), biological factor of interest (X), and a vector for indices of negative control genes (ctl).

3.1.1 Apply RUVinv using house-keeping genes

We begin with the list of housekeeping (HK) genes as our negative controls.

rr

# library(ruv)
HKgenes<- read.table(\HouseKeeping_genes_IDs.txt\, header=T, sep=\\t\)
hk<- HKgenes$GeneID
matA<- t(mA)
matB<- t(mB)
ctrl<- colnames(matA) %in% hk

We run RUVinv using HK genes on samples A data.

Check if HK genes as negative control genes in the initial analysis were useful.

## Look at the distribution of p-values
ruv_hist(fit_ruvin_hk_samplesA.summary)

## Look at the volcano plot
genecoloring <- list(aes(color = Categ),
                     scale_color_manual(name = "Gene Category",
                                        values = alpha(c("Red", "black"),
                                                     c( 0.25, 0.1))))
ruv_volcano(fit_ruvin_hk_samplesA.summary) + genecoloring

## Look at ECDF of p-values
ruv_ecdf(fit_ruvin_hk_samplesA.summary) + genecoloring

Similarly, we run RUVinv using HK genes on samples B data.

Check if HK genes as negative control genes in the initial analysis were useful.

## Look at the distribution of p-values
ruv_hist(fit_ruvin_hk_samplesB.summary)

## Look at the volcano plot
genecoloring <- list(aes(color = fit.ctl),
                     scale_color_manual(name = "Gene Category",
                                        values = alpha(c("Red", "black"),
                                                     c( 0.1, 0.25))))
ruv_volcano(fit_ruvin_hk_samplesB.summary) + genecoloring

genecoloring <- list(aes(color = Categ),
                     scale_color_manual(name = "Gene Category",
                                        values = alpha(c("Red", "black"),
                                                     c( 0.25, 0.1))))
ruv_volcano(fit_ruvin_hk_samplesB.summary) + genecoloring

## Look at ECDF of p-values
ruv_ecdf(fit_ruvin_hk_samplesB.summary) + genecoloring

3.1.2 Apply RUVinv using empirical control genes

According to the above results, some of the HK genes seem to be differentially expressed. In these cases, it is often recomended to use RUV4 as we can manually change the k, where smaller k may retain more of the biological signal and larger k may overadjust the data. For now, we continue with RUVinv to show how to define empirical negative control genes, by selecting those genes that were not statistically significant in the results of teh initital analysis. Next, we show the use of RUV4 on these data.

## Selecting empirical negative controls genes
cGenes_samplesA <- row.names(fit_ruvin_hk_samplesA.summary$C)[fit_ruvin_hk_samplesA.summary$C$F.p.BH > 0.05]  
length(cGenes_samplesA)
[1] 7916
empCtrl_ruvinv_samplesA <- colnames(YA) %in% cGenes_samplesA
Gene_Category <- empCtrl_ruvinv_samplesA
Gene_Category[empCtrl_ruvinv_samplesA == T] <- "EmpCtl"
Gene_Category[empCtrl_ruvinv_samplesA == F] <- "Non_EmpCtl"
Emp_Ct_Genes_samplesA <- data.frame(geneids = colnames(YA),
                                    empCt_RUV = empCtrl_ruvinv_samplesA,
                                    Categ = Gene_Category)
head(Emp_Ct_Genes_samplesA)

And similarly for sample B:

## Select empirical negative controls genes
cGenes_samplesB <- row.names(fit_ruvin_hk_samplesB.summary$C)[fit_ruvin_hk_samplesB.summary$C$F.p.BH > 0.05]
length(cGenes_samplesB)
[1] 8919
empCtrl_ruvinv_samplesB<- colnames(YB) %in% cGenes_samplesB
Gene_Category <- empCtrl_ruvinv_samplesB
Gene_Category[empCtrl_ruvinv_samplesB==T]="EmpCtl"
Gene_Category[empCtrl_ruvinv_samplesB==F]="Non_EmpCtl"
Emp_Ct_Genes_samplesB <- data.frame(geneids = colnames(matB),
                                    empCt_RUV = empCtrl_ruvinv_samplesB,
                                    Categ = Gene_Category)
head(Emp_Ct_Genes_samplesB)

Now we use those empirical control genes to apply RUVinv in samples A and B data sets.

##---- In samples A data:
fit_ruvin_emp_samplesA <- RUVinv(Y = YA, X = gA, 
                                 ctl = Emp_Ct_Genes_samplesA$empCt_RUV,
                                 Z = 1, fullW0 = NULL, 
                                 lambda = NULL, iterN = 100000)
fit_ruvin_emp_samplesA.summary <- ruv_summary(YA, fit_ruvin_emp_samplesA, 
                                             info_samplesA,              ## row info
                                             Emp_Ct_Genes_samplesA)      ## col info
##---- In samples B data: 
fit_ruvin_emp_samplesB <- RUVinv(Y = YB, X = gB, 
                                ctl = Emp_Ct_Genes_samplesB$empCt_RUV, 
                                Z = 1, fullW0 = NULL, 
                                lambda = NULL, iterN = 100000)
fit_ruvin_emp_samplesB.summary <- ruv_summary(YB, fit_ruvin_emp_samplesB, 
                                             info_samplesB, 
                                             Emp_Ct_Genes_samplesB)

Here, we look at the statistics obtained for samples A data.

## Look at the distribution of p-values
ruv_hist(fit_ruvin_emp_samplesA.summary)

## Look at the Volcano plot
genecoloring <- list(aes(color = Categ), 
                    scale_color_manual(name = "Gene Category",
                                       values = alpha(c("Black", "red"),
                                                      c( 0.1, 0.25))))
ruv_volcano(fit_ruvin_emp_samplesA.summary) + genecoloring

## Look at ECDF of p-values
ruv_ecdf(fit_ruvin_emp_samplesA.summary) + genecoloring

Now we look at the statistics obtained for samples B data.

## Look at the distribution of p-values
ruv_hist(fit_ruvin_emp_samplesB.summary)

## Look at the volcano plot
genecoloring <- list(aes(color = Categ),
                    scale_color_manual(name = "Gene Category",
                                       values = alpha(c("Black","Red"),
                                                    c( 0.2, 0.2))))
ruv_volcano(fit_ruvin_emp_samplesB.summary) + genecoloring

## Look at ECDF of p-values
ruv_ecdf(fit_ruvin_emp_samplesB.summary) + genecoloring

3.2 RUV4

RUV4 can be run with different values of k in an attempt to find the optimal value. Note that although there is a function to estimate k, called getK(), this may not give the optimal value for k and is often recomended to be used when there is no other choice but to automate finding K (e.g. in simulations).

3.2.1 Apply RUV4 using empirical control genes

Run RUV4 using different values of k and emprical controls.


## Instead of estimating k (commented below), we look at different values for k.
# estimateK<- getK(YA, X=gA,
#                  ctl=Emp_Ct_Genes_samplesA$empCt_RUV,
#                  Z = 1, eta = NULL, fullW0 = NULL, cutoff = NULL,
#                  method="select", l=1, inputcheck = TRUE)
# kA <- estimateK$k   ## it will be k = 11

# estimateK<- getK(YB, X=gB,
#                  ctl=Emp_Ct_Genes_samplesB$empCt_RUV,
#                  Z = 1, eta = NULL, fullW0 = NULL, cutoff = NULL,
#                  method="select", l=1, inputcheck = TRUE)
# kB <- estimateK$k   ## it will be k = 8

ks <- c(1, 2, 5, 6, 7, 8, 10, 11, 12, 15, 18, 20, 22, 23, 24)
## For k > 24 I got Error:
# NaNs producedNaNs producedError in sigmashrink(fit$sigma2, fit$df) : 
#   NA/NaN/Inf in foreign function call (arg 1)

beta_corAB <- vector()

for (K in ks){
  fit_ruv4_emp_sampleA <- RUV4(YA, X = gA, 
                            ctl = Emp_Ct_Genes_samplesA$empCt_RUV, 
                            k = K,Z = 1, eta = NULL, 
                            fullW0 = NULL, inputcheck = TRUE)
  
  fit_ruv4_emp_sampleA.summary <- ruv_summary(YA,
                                           fit_ruv4_emp_sampleA,
                                           info_samplesA,
                                           Emp_Ct_Genes_samplesA)
  
  fit_ruv4_emp_sampleB <- RUV4(YB, X = gB, 
                            ctl = Emp_Ct_Genes_samplesB$empCt_RUV, 
                            k = K,Z = 1, eta = NULL, 
                            fullW0 = NULL, inputcheck = TRUE)
  
  fit_ruv4_emp_sampleB.summary <- ruv_summary(YB,
                                           fit_ruv4_emp_sampleB,
                                           info_samplesB,
                                           Emp_Ct_Genes_samplesB)
  
  currentCor <- cor.test(fit_ruv4_emp_sampleA$betahat,
                        fit_ruv4_emp_sampleB$betahat)$estimate
  
  beta_corAB <- c(beta_corAB, currentCor)
  
}
names(beta_corAB) <- ks
beta_corAB ## K = 23 seems a good choice

3.2.2 Apply RUV4 using house-keeping genes


## Instead of estimating k (commented below), we look at different values for k.
# estimateK<- getK(YA, X = gA,
#                  ctl = ctrl,
#                  Z = 1, eta = NULL, fullW0 = NULL, cutoff = NULL,
#                  method="select", l=1, inputcheck = TRUE)
# kA <- estimateK$k   ## it will be k = 7
# 
# estimateK<- getK(YB, X = gB,
#                  ctl = ctrl,
#                  Z = 1, eta = NULL, fullW0 = NULL, cutoff = NULL,
#                  method="select", l=1, inputcheck = TRUE)
# kB <- estimateK$k   ## it will be k = 6

ks <- c(1, 2, 5, 6, 7, 8, 10, 11, 12, 15, 18, 20, 22, 23, 24)
## For k > 24 I got Error:
# NaNs producedNaNs producedError in sigmashrink(fit$sigma2, fit$df) : 
#   NA/NaN/Inf in foreign function call (arg 1)

beta_corAB_HK <- vector()

for (K in ks){
  fit_ruv4_hk_sampleA <- RUV4(YA, X = gA, 
                            ctl = ctrl, 
                            k = K,Z = 1, eta = NULL, 
                            fullW0 = NULL, inputcheck = TRUE)
  
  fit_ruv4_hk_sampleA.summary <- ruv_summary(YA,
                                           fit_ruv4_hk_sampleA,
                                           info_samplesA)
  
  fit_ruv4_hk_sampleB <- RUV4(YB, X = gB, 
                            ctl = ctrl, 
                            k = K,Z = 1, eta = NULL, 
                            fullW0 = NULL, inputcheck = TRUE)
  
  fit_ruv4_hk_sampleB.summary <- ruv_summary(YB,
                                           fit_ruv4_hk_sampleB,
                                           info_samplesB)
  
  currentCor <- cor.test(fit_ruv4_hk_sampleA$betahat,
                        fit_ruv4_hk_sampleB$betahat)$estimate
  
  beta_corAB_HK <- c(beta_corAB_HK, currentCor)
  
}
names(beta_corAB_HK) <- ks
beta_corAB_HK ## K = 23 seems a good choice

As k = 23 results in the highest correlation between samples A and B data sets, we consider that as the optimal value, and we run RUV4 using k = 23.


##------ If using EmpCtrl:
K = 23
fit_ruv4_emp_sampleA <- RUV4(YA, X = gA, 
                          ctl = Emp_Ct_Genes_samplesA$empCt_RUV, 
                          k = K,Z = 1, eta = NULL, 
                          fullW0 = NULL, inputcheck = TRUE)

fit_ruv4_emp_sampleA.summary <- ruv_summary(YA,
                                         fit_ruv4_emp_sampleA,
                                         info_samplesA,
                                         Emp_Ct_Genes_samplesA)

fit_ruv4_emp_sampleB <- RUV4(YB, X = gB, 
                          ctl = Emp_Ct_Genes_samplesB$empCt_RUV, 
                          k = K,Z = 1, eta = NULL, 
                          fullW0 = NULL, inputcheck = TRUE)

fit_ruv4_emp_sampleB.summary <- ruv_summary(YB,
                                         fit_ruv4_emp_sampleB,
                                         info_samplesB,
                                         Emp_Ct_Genes_samplesB)

##------- If using HK genes:
fit_ruv4_hk_sampleA <- RUV4(YA, X = gA, 
                          ctl = ctrl, 
                          k = K,Z = 1, eta = NULL, 
                          fullW0 = NULL, inputcheck = TRUE)

fit_ruv4_hk_sampleA.summary <- ruv_summary(YA,
                                         fit_ruv4_hk_sampleA,
                                         info_samplesA)

fit_ruv4_hk_sampleB <- RUV4(YB, X = gB, 
                          ctl = ctrl, 
                          k = K,Z = 1, eta = NULL, 
                          fullW0 = NULL, inputcheck = TRUE)

fit_ruv4_hk_sampleB.summary <- ruv_summary(YB,
                                         fit_ruv4_hk_sampleB,
                                         info_samplesB)

4 Comparison of results of unadjusted, RUVinv- and RUV4-adjusted data

In order to see if we have helped or not, we run differential expression analysis on the unadjusted data. Then we compare the results of RUVinv, RUV4 and unadjusted together.

4.1 Unadjusted data

We can run RUV4 with k = 0 to do no adjustment when obtaining DEGs.

4.1.1 Apply DE analysis in the unadjusted data using HK genes

# RUV4 with k = 0 for no adjustment
# Equivalent to a Limma Analysis without considering the batch term
##----- In sample A data
fit_unadj_hk_sampleA <- RUV4(YA, X = gA, 
                          ctl = ctrl, 
                          k = 0)
fit_unadj_hk_sampleA.summary <- ruv_summary(YA, 
                                        fit_unadj_hk_sampleA,
                                        info_samplesA)
##----- In sample B data
fit_unadj_hk_sampleB <- RUV4(YB, X = gB, 
                          ctl = ctrl,
                          k = 0)
fit_unadj_hk_sampleB.summary <- ruv_summary(YB, 
                                         fit_unadj_hk_sampleB,
                                         info_samplesB)

4.1.2 Apply DE analysis in the unadjusted data using empirical control genes

# RUV4 with k = 0 for no adjustment
# Equivalent to a Limma Analysis without considering the batch term
##----- In sample A data
fit_unadj_emp_sampleA <- RUV4(YA, X = gA, 
                          ctl = Emp_Ct_Genes_samplesA$empCt_RUV, 
                          k = 0)
fit_unadj_emp_sampleA.summary <- ruv_summary(YA, 
                                        fit_unadj_emp_sampleA,
                                        info_samplesA, 
                                        Emp_Ct_Genes_samplesA)
##----- In sample B data
fit_unadj_emp_sampleB <- RUV4(YB, X = gB, 
                          ctl = Emp_Ct_Genes_samplesB$empCt_RUV, 
                          k = 0)
fit_unadj_emp_sampleB.summary <- ruv_summary(YB, 
                                         fit_unadj_emp_sampleB,
                                         info_samplesB,
                                         Emp_Ct_Genes_samplesB)

4.2 P-values distribution

We compare the results obtained from unajusted, RUVinv- and RUV4- adjusted data using p-value distributions, correlations of beta values and venn diagrams in samples A and B data.

4.3 Betahat correlation

For each of the unadjusted, RUVinv- and RUV4- adjusted settings, we can compare the results between samples A and B data sets to see which method gives more consistent resulst in these two data sets. To quantify these consistencies, we look at the correlations between betahat from sample A and betahat from sample B for the unadjusted method, RUVinv and RUV4.

##------ Unadjusted data sets
plot(fit_unadj_emp_sampleA$betahat, 
     fit_unadj_emp_sampleB$betahat,
     xlab = "Betahat Samples A",
     ylab = "Betahat Samples B",
     main = "Unadjusted",
     xlim = c(-3,3), cex = 0.3, ylim = c(-4,4))
# abline(fit_unadj_emp_sampleB$betahat,fit_unadj_emp_sampleA$betahat)
corVal <- cor.test(fit_unadj_emp_sampleA$betahat, fit_unadj_emp_sampleB$betahat)$estimate
text(-3,3, pos = 4, paste("Correlation: ", round(corVal,2), sep = ""))

##------ RUVinv adjusted data sets
plot(fit_ruvin_emp_samplesA$betahat,
     fit_ruvin_emp_samplesB$betahat,
     xlab = "Betahat Samples A",
     ylab = "Betahat Samples B",
     main = "RUVinv",
     xlim = c(-3,3), cex=0.3, ylim=c(-4,4))
#abline(fit_ruvin_emp_samplesB$betahat,fit_ruvin_emp_samplesA$betahat)
corVal <- cor.test(fit_ruvin_emp_samplesA$betahat, fit_ruvin_emp_samplesB$betahat)$estimate
text(-3,3, pos = 4, paste("Correlation: ", round(corVal,2), sep = ""))

#------- RUV4 adjusted data sets
plot(fit_ruv4_emp_sampleA$betahat,
     fit_ruv4_emp_sampleB$betahat,
     xlab = "Betahat Samples A",
     ylab = "Betahat Samples B",
     main = "RUV4",
     xlim = c(-3,3), cex = 0.3, ylim = c(-4,4))
#abline(fit_ruv4_emp_sampleB$betahat,fit_ruv4_emp_sampleA$betahat)
corVal <- cor.test(fit_ruv4_emp_sampleA$betahat, fit_ruv4_emp_sampleB$betahat)$estimate
text(-3,3, pos=4, paste("Correlation: ", round(corVal,2),sep=""))

4.4 Overlap between differentially expressed genes

Finally, for each of the unadjusted, RUVinv and RUV4, we look at the number of overlapping differentially expressed genes (DEGs) between the two samples A and B data sets. We also check for consistency between methods on the same data (sample A or sample B data set).
First, we define DEGs as genes with adjusted p-value < 0.05 and |logFC| > 1 in samples A and B data sets which are unadjusted, RUV4- or RUVinv-adjusted.

DEGsUnadj_sampleA <- row.names(fit_unadj_emp_sampleA.summary$C)
[fit_unadj_emp_sampleA.summary$C$F.p.BH < 0.05 &
Error: unexpected '[' in "["

4.4.1 DEGs in the unadjusted data

Venn diagram comparing the unadjusted samples A and B data.

allDEGs_unadj <- c(DEGsUnadj_sampleA, 
                    DEGsUnadj_sampleB)
## remove duplicated gene symbols:
allDEGs_unadj <- allDEGs_unadj[!duplicated(allDEGs_unadj)]  
## Draw a Venn diagram comparing DEGs for RUVinv
Counts_unadj <- matrix(0, nrow = length(allDEGs_unadj), ncol = 2)
row.names(Counts_unadj)<- allDEGs_unadj
colnames(Counts_unadj)<- c("Unadj_A","Unadj_B")
for( i in 1:length(allDEGs_unadj)) {
  Counts_unadj[i,1]<- allDEGs_unadj[i] %in% DEGsUnadj_sampleA
  Counts_unadj[i,2]<- allDEGs_unadj[i] %in% DEGsUnadj_sampleB
}
col <- c("blue", "violet")
vennDiagram(vennCounts(Counts_unadj), 
            circle.col = col, 
            cex = c(1.6, 1.2, 1), lwd=2)

4.4.2 DEGs in the RUVinv-adjusted data

Venn diagram comparing the RUVinv-adjusted samples A and B data.

allDEGs_RUVinv <- c(DEGsRUVinv_sampleA, 
                    DEGsRUVinv_sampleB)
## remove duplicated gene symbols:
allDEGs_RUVinv <- allDEGs_RUVinv[!duplicated(allDEGs_RUVinv)]  
## Draw a Venn diagram comparing DEGs for RUVinv
Counts_RUVinv <- matrix(0, nrow = length(allDEGs_RUVinv), ncol = 2)
row.names(Counts_RUVinv)<- allDEGs_RUVinv
colnames(Counts_RUVinv)<- c("RUVinv_A","RUVinv_B")
for( i in 1:length(allDEGs_RUVinv)) {
  Counts_RUVinv[i,1]<- allDEGs_RUVinv[i] %in% DEGsRUVinv_sampleA
  Counts_RUVinv[i,2]<- allDEGs_RUVinv[i] %in% DEGsRUVinv_sampleB
}
col<- c("blue", "violet")
vennDiagram(vennCounts(Counts_RUVinv), 
            circle.col = col,
            cex = c(1.6, 1.2, 1), lwd = 2)

4.4.3 DEGs in the RUV4-adjusted data

Venn diagram comparing the RUV4-adjusted samples A and B data.

allDEGs_RUV4<- c(DEGsRUV4_sampleA, DEGsRUV4_sampleB)
## remove duplicated gene symbols:
allDEGs_RUV4<- allDEGs_RUV4[!duplicated(allDEGs_RUV4)]  
## Draw a Venn diagram comparing DEGs for RUV4
Counts_RUV4 <- matrix(0, nrow= length(allDEGs_RUV4), ncol=2)
row.names(Counts_RUV4)<- allDEGs_RUV4
colnames(Counts_RUV4)<- c("RUV4_A","RUV4_B")
for( i in 1:length(allDEGs_RUV4)) {
  Counts_RUV4[i,1]<- allDEGs_RUV4[i] %in% DEGsRUV4_sampleA
  Counts_RUV4[i,2]<- allDEGs_RUV4[i] %in% DEGsRUV4_sampleB
}
col<- c("blue", "violet")
vennDiagram(vennCounts(Counts_RUV4), 
            circle.col = col, 
            cex = c(1.6, 1.2, 1), lwd = 2)

4.4.4 Compare the unadjusted, RUV4 and RUVinv-adjusted in samples A data

4.4.5 Compare the unadjusted, RUV4 and RUVinv-adjusted in samples B data

allDEGs_sampleB <- c(DEGsUnadj_sampleB,
                     DEGsRUVinv_sampleB,
                     DEGsRUV4_sampleB)
## remove duplicated gene symbols:
allDEGs_sampleB <- allDEGs_sampleB[!duplicated(allDEGs_sampleB)] 
## Draw a Venn diagram comparing DEGs for sample B
Counts_sampleB <- matrix(0, nrow = length(allDEGs_sampleB), ncol = 3)
row.names(Counts_sampleB) <- allDEGs_sampleB
colnames(Counts_sampleB) <- c("Unadj_B", "RUVinv_B","Ruv4_B")
for( i in 1:length(allDEGs_sampleB)) {
  Counts_sampleB[i,1]<- allDEGs_sampleB[i] %in% DEGsUnadj_sampleB
  Counts_sampleB[i,2]<- allDEGs_sampleB[i] %in% DEGsRUVinv_sampleB
  Counts_sampleB[i,3]<- allDEGs_sampleB[i] %in% DEGsRUV4_sampleB
}
col <- c("blue", "violet", "darkgreen")
vennDiagram(vennCounts(Counts_sampleB), 
            circle.col = col, 
            cex = c(1.6, 1.2, 1), lwd = 2)

---
title: "How to remove unwanted variation from transcriptomics data using RUVinv and RUV4"
author: Sepideh Foroutan and Marie Trussart
output:
  github_document:
    toc: yes
  html_notebook:
    number_sections: yes
    toc: yes
    toc_float: yes
    code_folding: "hide"
    fig_caption: yes
---


# Data description
In this tutorial, we aim to obtain differentially expressed genes between two biological conditions in an integrated data. This integrated data set is generated by combining 10 microarray studies on control and TGFb-treated samples (for more information, see here http://mcr.aacrjournals.org/content/15/5/619).\ 

First, we assess the presence of unwanted variation in the integrated data which combines differents studies and platforms. Second, we show how to use two RUV methods: RUVinv and RUV4 to remove unwanted variation and detect differentially expressed genes comparing control samples to TGFb-treated samples. Third, we assess whether RUV methods are useful and compare the results obtained by each method.
```{r load-libs}
library(ruv)            ## for applying RUV methods
library(limma)          ## for vennDiagram()
library(ggplot2)        ## for data visualisation
```

The integrated data introduced above have been split into two data sets: samples A and samples B, each containing different studies, platforms and tissues. We will explore and normalise these two data sets separately in order to compare the results obtained by the two normalisation methods (RUV4 and RUVinv).

Read in the two integrated datasets.
```{r load-data}
samplesA<- read.table("expr_10data_sampleA.txt", header = T, sep = "\t")
samplesB<- read.table("expr_10data_sampleB.txt", header = T, sep = "\t")
```

Look at the data in each samples A and B, then make a matrix for each these data where the row names are gene IDs.
```{r explore-data}

head(samplesA,3)
head(samplesB,3)
mA<-samplesA[,2:dim(samplesA)[2]]
mB<-samplesB[,2:dim(samplesB)[2]]
row.names(mA)<- samplesA[,1]
row.names(mB)<- samplesB[,1]
mA<- as.matrix(mA)
mB<- as.matrix(mB)
```

Look at the information related to each sample including the name of the studies, types of platform, treatment, and tissue:
```{r load-info-study}
info_samplesA<- read.table("info_10data_sampleA.txt", sep="\t", header=T)
info_samplesA
info_samplesB<- read.table("info_10data_sampleB.txt", sep="\t", header=T)
info_samplesB
```

# Assessment of unwanted variation in the data
Here we perform some exploratory analysis on the integrated data to assess the presence of unwanted variation in each dataset.

## RLE plot
We start by looking at the RLE plots in samples A data, coloured by study, platform and tissue:
```{r rleplots-samplesA}
## Transpose the expression matrix so that we have genes in columns and samples in rows
YA <- t(mA)
## Plot RLE coloured by study
ruv_rle(YA, info_samplesA$study, ylim = c(-4,4))
## Plot RLE coloured by platform
ruv_rle(YA, info_samplesA$platform, ylim = c(-4,4))
## Plot RLE coloured by platform
ruv_rle(YA, info_samplesA$tissue, ylim = c(-4,4))
```

Similarly, we look at the RLE plots in sample B data coloured by study, platform and tissue:
```{r rleplots-samplesB}
YB <- t(mB)
## Plot RLE coloured by study
ruv_rle(YB, info_samplesB$study, ylim = c(-4,4))
## Plot RLE coloured by platform
ruv_rle(YB, info_samplesB$platform, ylim = c(-4,4))
## Plot RLE coloured by tissue
ruv_rle(YB, info_samplesB$tissue, ylim = c(-4,4))
```

## PCA plot
In transcriptomics applications, one of the most utilized exploratory plots is the multi-dimensional scaling (MDS) plot or a principal component analysis (PCA) plot. To assess the presence of unwanted variation in each sample, we use PCA plots to show similarities between samples measured in an unsupervised way. Ideally, samples should cluster together according to the treatment (i.e. the biological factor of interest). Here, we see that samples are rather clustered by studies (i.e. unwanted variation) in both samples A and B data.\
In the current example, it's important to note that in some cases, different studies are confounded with different platforms and tissues, and therefore there is no way to identify how much of the unwanted variation come from each of these factors. Such situations must be avoided when designing an experiment and for the purpose of this tutorial, we only consider "study" as the source of unwanted variation.\

```{r mds-plot-sampleA}
gg_additions <- list(aes(color = info_samplesA$study, 
                         shape = info_samplesA$treatment, 
                         size = 5, alpha = .7),
                     labs(color = "Study", shape="Treatment"),
                     scale_size_identity(guide = "none"),
                     scale_alpha(guide = "none"),
                     theme(legend.text = element_text(size = 12),
                           legend.title = element_text(size = 16)),
                     guides(color = guide_legend(override.aes = list(size = 4)),
                            shape = guide_legend(override.aes = list(size = 4))))
options(repr.plot.width = 8, repr.plot.height = 6)
ruv_svdplot(YA) + gg_additions 
```

```{r mds-plot-sampleB}
gg_additions <- list(aes(color = info_samplesB$study,
                         shape = info_samplesB$treatment,
                         size = 5, alpha = .7),
                     labs(color = "Study",shape = "Treatment"),
                     scale_size_identity(guide = "none"),
                     scale_alpha(guide = "none"),
                     theme(legend.text = element_text(size = 12),
                           legend.title = element_text(size = 16)),
                     guides(color = guide_legend(override.aes = list(size = 4)),
                            shape = guide_legend(override.aes = list(size = 4))))
options(repr.plot.width=8, repr.plot.height=6)
ruv_svdplot(YB) + gg_additions #
```


# Remove batch effects using RUV methods
There are several RUV methods for removing unwanted variation in order to obtain DEGs; these include RUV-2, RUV-4, RUV-inv and RUV-rinv. In general, RUV methods are dependent on negative control genes (genes which are not associated with the biological factor of interest) and replicate samples (if applicable). RUV-2 removes unwanted variation in two steps. RUV-4 came after RUV-2 and has four steps. For RUV-4 we can estimate the dimension of unwanted variation (k) or select different values for k, while for RUV-inv and RUV-rinv we don't need to estimate k as it is set to be the maximum value. In general, RUV-inv and RUV-rinv are better than RUV-4. RUV-inv is recommended when we have large number of control genes (~1000), while RUV-rinv is more appropriate with small number of control genes (~60).\

Selection of negative control genes is very important. Examples of negative control genes are the spike-in controls or the housekeeping (HK) genes. It is also possible to define empirical negative control genes using an iterative approach. However, it is only recomended if (i) the initial negative control genes are not very good or there are only a few of them; (ii) the beta seems to be very sparse. Therefore, the user needs to generate diagnostic plots to assess the performance of the initial analysis, and only if needed, use an iterative approach to define better control genes.

## RUVinv

Here, we first focus on RUV-inv, which is performed by **RUVinv()** function. This function take expression matrix (Y), biological factor of interest (X), and a vector for indices of negative control genes (ctl).

### Apply RUVinv using house-keeping genes
We begin with the list of housekeeping (HK) genes as our negative controls.
```{r}
HKgenes <- read.table("HouseKeeping_genes_IDs.txt", header=T, sep="\t")
hk <- HKgenes$GeneID
ctrl <- colnames(YA) %in% hk
```

We run RUVinv using HK genes on samples A data.
```{r initial-analysis-hk-genes-samplesA}
## Take treatment as the biological factor of interest
groups_A <- factor(info_samplesA$treatment) 
gA <- cbind(as.numeric(groups_A))   ## 1 control, 2: TGFb


## Apply RUV-inv using the housekeeping genes as negative controls 
fit_ruvin_hk_samplesA <- RUVinv(YA, gA, ctrl)
fit_ruvin_hk_samplesA.summary <- ruv_summary(YA, 
                                             fit_ruvin_hk_samplesA, 
                                             info_samplesA)
head(fit_ruvin_hk_samplesA.summary$C)
```

Check if HK genes as negative control genes in the initial analysis were useful.
```{r diagnostic-plots-intial-analysis-samplesA}
## Look at the distribution of p-values
ruv_hist(fit_ruvin_hk_samplesA.summary)

## Look at the volcano plot
genecoloring <- list(aes(color = fit.ctl),
                     scale_color_manual(name = "Gene Category",
                                        values = alpha(c("Red", "black"),
                                                     c( 0.1, 0.25))))
ruv_volcano(fit_ruvin_hk_samplesA.summary) + genecoloring

## Look at ECDF of p-values
ruv_ecdf(fit_ruvin_hk_samplesA.summary) + genecoloring
```

Similarly, we run RUVinv using HK genes on samples B data.
```{r initial-analysis-hk-genes-samplesB}
## Take treatment as the biological factor of interest
groups_B <- factor(info_samplesB$treatment) 
gB <- cbind(as.numeric(groups_B))   ## 1 control, 2: TGFb

## Apply RUV-inv using the housekeeping genes as negative controls 
fit_ruvin_hk_samplesB <- RUVinv(YB, gB, ctrl)
fit_ruvin_hk_samplesB.summary <- ruv_summary(YB, 
                                             fit_ruvin_hk_samplesB, 
                                             info_samplesB)

head(fit_ruvin_hk_samplesB.summary$C)
```

Check if HK genes as negative control genes in the initial analysis were useful.
```{r diagnostic-plots-intial-analysis-samplesB}
## Look at the distribution of p-values
ruv_hist(fit_ruvin_hk_samplesB.summary)

## Look at the volcano plot
genecoloring <- list(aes(color = fit.ctl),
                     scale_color_manual(name = "Gene Category",
                                        values = alpha(c("Red", "black"),
                                                     c( 0.1, 0.25))))
ruv_volcano(fit_ruvin_hk_samplesB.summary) + genecoloring

## Look at ECDF of p-values
ruv_ecdf(fit_ruvin_hk_samplesB.summary) + genecoloring
```

### Apply RUVinv using empirical control genes
According to the above results, some of the HK genes seem to be differentially expressed. In these cases, it is often recomended to use RUV4 as we can manually change the k, where smaller k may retain more of the biological signal and larger k may overadjust the data. For now, we continue with RUVinv to show how to define **empirical negative control genes**, by selecting those genes that were not statistically significant in the results of teh initital analysis. Next, we show the use of RUV4 on these data. 

```{r ruv-inv-samplesA}
## Selecting empirical negative controls genes
cGenes_samplesA <- row.names(fit_ruvin_hk_samplesA.summary$C)[fit_ruvin_hk_samplesA.summary$C$F.p.BH > 0.05]  
length(cGenes_samplesA)

empCtrl_ruvinv_samplesA <- colnames(YA) %in% cGenes_samplesA

Gene_Category <- empCtrl_ruvinv_samplesA
Gene_Category[empCtrl_ruvinv_samplesA == T] <- "EmpCtl"
Gene_Category[empCtrl_ruvinv_samplesA == F] <- "Non_EmpCtl"
Emp_Ct_Genes_samplesA <- data.frame(geneids = colnames(YA),
                                    empCt_RUV = empCtrl_ruvinv_samplesA,
                                    Categ = Gene_Category)
head(Emp_Ct_Genes_samplesA)

```
 And similarly for sample B:
```{r ruv-inv-samplesB}
## Select empirical negative controls genes
cGenes_samplesB <- row.names(fit_ruvin_hk_samplesB.summary$C)[fit_ruvin_hk_samplesB.summary$C$F.p.BH > 0.05]
length(cGenes_samplesB)

empCtrl_ruvinv_samplesB<- colnames(YB) %in% cGenes_samplesB

Gene_Category <- empCtrl_ruvinv_samplesB
Gene_Category[empCtrl_ruvinv_samplesB==T]="EmpCtl"
Gene_Category[empCtrl_ruvinv_samplesB==F]="Non_EmpCtl"
Emp_Ct_Genes_samplesB <- data.frame(geneids = colnames(YB),
                                    empCt_RUV = empCtrl_ruvinv_samplesB,
                                    Categ = Gene_Category)

head(Emp_Ct_Genes_samplesB)

```

Now we use those empirical control genes to apply RUVinv in samples A and B data sets.
```{r emp-ruvinv}
##---- In samples A data:
fit_ruvin_emp_samplesA <- RUVinv(Y = YA, X = gA, 
                                 ctl = Emp_Ct_Genes_samplesA$empCt_RUV,
                                 Z = 1, fullW0 = NULL, 
                                 lambda = NULL, iterN = 100000)

fit_ruvin_emp_samplesA.summary <- ruv_summary(YA, fit_ruvin_emp_samplesA, 
                                             info_samplesA,              ## row info
                                             Emp_Ct_Genes_samplesA)      ## col info

##---- In samples B data: 
fit_ruvin_emp_samplesB <- RUVinv(Y = YB, X = gB, 
                                ctl = Emp_Ct_Genes_samplesB$empCt_RUV, 
                                Z = 1, fullW0 = NULL, 
                                lambda = NULL, iterN = 100000)

fit_ruvin_emp_samplesB.summary <- ruv_summary(YB, fit_ruvin_emp_samplesB, 
                                             info_samplesB, 
                                             Emp_Ct_Genes_samplesB)
```

Here, we look at the statistics obtained for samples A data.
```{r emp-ruvinv-sampleA-results}
## Look at the distribution of p-values
ruv_hist(fit_ruvin_emp_samplesA.summary)

## Look at the Volcano plot
genecoloring <- list(aes(color = Categ), 
                    scale_color_manual(name = "Gene Category",
                                       values = alpha(c("Black", "red"),
                                                      c( 0.1, 0.25))))
ruv_volcano(fit_ruvin_emp_samplesA.summary) + genecoloring

## Look at ECDF of p-values
ruv_ecdf(fit_ruvin_emp_samplesA.summary) + genecoloring
```


Now we look at the statistics obtained for samples B data.
```{r emp-ruvinv-sampleB-results}
## Look at the distribution of p-values
ruv_hist(fit_ruvin_emp_samplesB.summary)

## Look at the volcano plot
genecoloring <- list(aes(color = Categ),
                    scale_color_manual(name = "Gene Category",
                                       values = alpha(c("Black","Red"),
                                                    c( 0.2, 0.2))))
ruv_volcano(fit_ruvin_emp_samplesB.summary) + genecoloring

## Look at ECDF of p-values
ruv_ecdf(fit_ruvin_emp_samplesB.summary) + genecoloring
```

## RUV4
RUV4 can be run with different values of k in an attempt to find the optimal value. Note that although there is a function to estimate k, called **getK()**, this may not give the optimal value for k and is often recomended to be used when there is no other choice but to automate finding K (e.g. in simulations). 

### Apply RUV4 using empirical control genes
Run RUV4 using different values of k and emprical controls.
``` {r ruv4-Empirical-differentKs}

## Instead of estimating k (commented below), we look at different values for k.
# estimateK<- getK(YA, X=gA,
#                  ctl=Emp_Ct_Genes_samplesA$empCt_RUV,
#                  Z = 1, eta = NULL, fullW0 = NULL, cutoff = NULL,
#                  method="select", l=1, inputcheck = TRUE)
# kA <- estimateK$k   ## it will be k = 11

# estimateK<- getK(YB, X=gB,
#                  ctl=Emp_Ct_Genes_samplesB$empCt_RUV,
#                  Z = 1, eta = NULL, fullW0 = NULL, cutoff = NULL,
#                  method="select", l=1, inputcheck = TRUE)
# kB <- estimateK$k   ## it will be k = 8

ks <- c(1, 2, 5, 6, 7, 8, 10, 11, 12, 15, 18, 20, 22, 23, 24)
## For k > 24 I got Error:
# NaNs producedNaNs producedError in sigmashrink(fit$sigma2, fit$df) : 
#   NA/NaN/Inf in foreign function call (arg 1)

beta_corAB <- vector()

for (K in ks){
  fit_ruv4_emp_sampleA <- RUV4(YA, X = gA, 
                            ctl = Emp_Ct_Genes_samplesA$empCt_RUV, 
                            k = K,Z = 1, eta = NULL, 
                            fullW0 = NULL, inputcheck = TRUE)
  
  fit_ruv4_emp_sampleA.summary <- ruv_summary(YA,
                                           fit_ruv4_emp_sampleA,
                                           info_samplesA,
                                           Emp_Ct_Genes_samplesA)
  
  fit_ruv4_emp_sampleB <- RUV4(YB, X = gB, 
                            ctl = Emp_Ct_Genes_samplesB$empCt_RUV, 
                            k = K,Z = 1, eta = NULL, 
                            fullW0 = NULL, inputcheck = TRUE)
  
  fit_ruv4_emp_sampleB.summary <- ruv_summary(YB,
                                           fit_ruv4_emp_sampleB,
                                           info_samplesB,
                                           Emp_Ct_Genes_samplesB)
  
  currentCor <- cor.test(fit_ruv4_emp_sampleA$betahat,
                        fit_ruv4_emp_sampleB$betahat)$estimate
  
  beta_corAB <- c(beta_corAB, currentCor)
  
}
names(beta_corAB) <- ks
beta_corAB ## K = 23 seems a good choice
```

### Apply RUV4 using house-keeping genes
``` {r ruv4-HK-differentKs}

## Instead of estimating k (commented below), we look at different values for k.
# estimateK<- getK(YA, X = gA,
#                  ctl = ctrl,
#                  Z = 1, eta = NULL, fullW0 = NULL, cutoff = NULL,
#                  method="select", l=1, inputcheck = TRUE)
# kA <- estimateK$k   ## it will be k = 7
# 
# estimateK<- getK(YB, X = gB,
#                  ctl = ctrl,
#                  Z = 1, eta = NULL, fullW0 = NULL, cutoff = NULL,
#                  method="select", l=1, inputcheck = TRUE)
# kB <- estimateK$k   ## it will be k = 6

ks <- c(1, 2, 5, 6, 7, 8, 10, 11, 12, 15, 18, 20, 22, 23, 24)
## For k > 24 I got Error:
# NaNs producedNaNs producedError in sigmashrink(fit$sigma2, fit$df) : 
#   NA/NaN/Inf in foreign function call (arg 1)

beta_corAB_HK <- vector()

for (K in ks){
  fit_ruv4_hk_sampleA <- RUV4(YA, X = gA, 
                            ctl = ctrl, 
                            k = K,Z = 1, eta = NULL, 
                            fullW0 = NULL, inputcheck = TRUE)
  
  fit_ruv4_hk_sampleA.summary <- ruv_summary(YA,
                                           fit_ruv4_hk_sampleA,
                                           info_samplesA)
  
  fit_ruv4_hk_sampleB <- RUV4(YB, X = gB, 
                            ctl = ctrl, 
                            k = K,Z = 1, eta = NULL, 
                            fullW0 = NULL, inputcheck = TRUE)
  
  fit_ruv4_hk_sampleB.summary <- ruv_summary(YB,
                                           fit_ruv4_hk_sampleB,
                                           info_samplesB)
  
  currentCor <- cor.test(fit_ruv4_hk_sampleA$betahat,
                        fit_ruv4_hk_sampleB$betahat)$estimate
  
  beta_corAB_HK <- c(beta_corAB_HK, currentCor)
  
}
names(beta_corAB_HK) <- ks
beta_corAB_HK ## K = 23 seems a good choice
```

As k = 23 results in the highest correlation between samples A and B data sets, we consider that as the optimal value, and we run RUV4 using k = 23.

```{r ruv4-k23}

##------ If using EmpCtrl:
K = 23
fit_ruv4_emp_sampleA <- RUV4(YA, X = gA, 
                          ctl = Emp_Ct_Genes_samplesA$empCt_RUV, 
                          k = K,Z = 1, eta = NULL, 
                          fullW0 = NULL, inputcheck = TRUE)

fit_ruv4_emp_sampleA.summary <- ruv_summary(YA,
                                         fit_ruv4_emp_sampleA,
                                         info_samplesA,
                                         Emp_Ct_Genes_samplesA)

fit_ruv4_emp_sampleB <- RUV4(YB, X = gB, 
                          ctl = Emp_Ct_Genes_samplesB$empCt_RUV, 
                          k = K,Z = 1, eta = NULL, 
                          fullW0 = NULL, inputcheck = TRUE)

fit_ruv4_emp_sampleB.summary <- ruv_summary(YB,
                                         fit_ruv4_emp_sampleB,
                                         info_samplesB,
                                         Emp_Ct_Genes_samplesB)

##------- If using HK genes:
fit_ruv4_hk_sampleA <- RUV4(YA, X = gA, 
                          ctl = ctrl, 
                          k = K,Z = 1, eta = NULL, 
                          fullW0 = NULL, inputcheck = TRUE)

fit_ruv4_hk_sampleA.summary <- ruv_summary(YA,
                                         fit_ruv4_hk_sampleA,
                                         info_samplesA)

fit_ruv4_hk_sampleB <- RUV4(YB, X = gB, 
                          ctl = ctrl, 
                          k = K,Z = 1, eta = NULL, 
                          fullW0 = NULL, inputcheck = TRUE)

fit_ruv4_hk_sampleB.summary <- ruv_summary(YB,
                                         fit_ruv4_hk_sampleB,
                                         info_samplesB)
```

# Comparison of results of unadjusted, RUVinv- and RUV4-adjusted data
In order to see if we have helped or not, we run differential expression analysis on the **unadjusted data**. Then we compare the results of RUVinv, RUV4 and unadjusted together.

## Unadjusted data
We can run RUV4 with k = 0 to do no adjustment when obtaining DEGs.

### Apply DE analysis in the unadjusted data using HK genes
```{r DE-unadjusted-A-B-using-HK}
# RUV4 with k = 0 for no adjustment
# Equivalent to a Limma Analysis without considering the batch term

##----- In sample A data
fit_unadj_hk_sampleA <- RUV4(YA, X = gA, 
                          ctl = ctrl, 
                          k = 0)
fit_unadj_hk_sampleA.summary <- ruv_summary(YA, 
                                        fit_unadj_hk_sampleA,
                                        info_samplesA)

##----- In sample B data
fit_unadj_hk_sampleB <- RUV4(YB, X = gB, 
                          ctl = ctrl,
                          k = 0)
fit_unadj_hk_sampleB.summary <- ruv_summary(YB, 
                                         fit_unadj_hk_sampleB,
                                         info_samplesB)

```

### Apply DE analysis in the unadjusted data using empirical control genes
```{r DE-unadjusted-A-B-using-EmpricalCtl}
# RUV4 with k = 0 for no adjustment
# Equivalent to a Limma Analysis without considering the batch term

##----- In sample A data
fit_unadj_emp_sampleA <- RUV4(YA, X = gA, 
                          ctl = Emp_Ct_Genes_samplesA$empCt_RUV, 
                          k = 0)
fit_unadj_emp_sampleA.summary <- ruv_summary(YA, 
                                        fit_unadj_emp_sampleA,
                                        info_samplesA, 
                                        Emp_Ct_Genes_samplesA)

##----- In sample B data
fit_unadj_emp_sampleB <- RUV4(YB, X = gB, 
                          ctl = Emp_Ct_Genes_samplesB$empCt_RUV, 
                          k = 0)
fit_unadj_emp_sampleB.summary <- ruv_summary(YB, 
                                         fit_unadj_emp_sampleB,
                                         info_samplesB,
                                         Emp_Ct_Genes_samplesB)

```

## P-values distribution
We compare the results obtained from unajusted, RUVinv- and RUV4- adjusted data using p-value distributions, correlations of beta values and venn diagrams in samples A and B data.

```{r}

ruv_hist(fit_unadj_emp_sampleA.summary) + ggtitle("Unadj_A")
ruv_hist(fit_unadj_emp_sampleB.summary) + ggtitle("Unadj_B")
# ruv_hist(fit_unadj_hk_sampleA.summary)
# ruv_hist(fit_unadj_hk_sampleB.summary)

ruv_hist(fit_ruvin_hk_samplesA.summary) + ggtitle("RUVinv_HK_A")
ruv_hist(fit_ruvin_hk_samplesB.summary) + ggtitle("RUVinv_HK_B")
ruv_hist(fit_ruvin_emp_samplesA.summary) + ggtitle("RUVinv_Emp_A")
ruv_hist(fit_ruvin_emp_samplesB.summary) + ggtitle("RUVinv_Emp_B")

ruv_hist(fit_ruv4_hk_sampleA.summary) + ggtitle("RUV4_HK_A")
ruv_hist(fit_ruv4_hk_sampleB.summary) + ggtitle("RUV4_HK_B")
ruv_hist(fit_ruv4_emp_sampleA.summary) + ggtitle("RUV4_Emp_A")
ruv_hist(fit_ruv4_emp_sampleB.summary) + ggtitle("RUV4_Emp_B")

```

## Betahat correlation
For each of the unadjusted, RUVinv- and RUV4- adjusted settings, we can compare the results between samples A and B data sets to see which method gives more consistent resulst in these two data sets. To quantify these consistencies, we look at the correlations between betahat from sample A and betahat from sample B for the unadjusted method, RUVinv and RUV4.

```{r cor-betahat-A-B-EmpCtl}
##------ Unadjusted data sets
plot(fit_unadj_emp_sampleA$betahat, 
     fit_unadj_emp_sampleB$betahat,
     xlab = "Betahat Samples A",
     ylab = "Betahat Samples B",
     main = "Unadjusted",
     xlim = c(-3,3), cex = 0.3, ylim = c(-4,4))
corVal <- cor.test(fit_unadj_emp_sampleA$betahat, fit_unadj_emp_sampleB$betahat)$estimate
text(-3,3, pos = 4, paste("Correlation: ", round(corVal,2), sep = ""))

##------ RUVinv adjusted data sets
plot(fit_ruvin_emp_samplesA$betahat,
     fit_ruvin_emp_samplesB$betahat,
     xlab = "Betahat Samples A",
     ylab = "Betahat Samples B",
     main = "RUVinv",
     xlim = c(-3,3), cex=0.3, ylim=c(-4,4))
corVal <- cor.test(fit_ruvin_emp_samplesA$betahat, fit_ruvin_emp_samplesB$betahat)$estimate
text(-3,3, pos = 4, paste("Correlation: ", round(corVal,2), sep = ""))

#------- RUV4 adjusted data sets
plot(fit_ruv4_emp_sampleA$betahat,
     fit_ruv4_emp_sampleB$betahat,
     xlab = "Betahat Samples A",
     ylab = "Betahat Samples B",
     main = "RUV4",
     xlim = c(-3,3), cex = 0.3, ylim = c(-4,4))
#abline(fit_ruv4_emp_sampleB$betahat,fit_ruv4_emp_sampleA$betahat)
corVal <- cor.test(fit_ruv4_emp_sampleA$betahat, fit_ruv4_emp_sampleB$betahat)$estimate
text(-3,3, pos=4, paste("Correlation: ", round(corVal,2),sep=""))

```

## Overlap between differentially expressed genes 
Finally, for each of the unadjusted, RUVinv and RUV4, we look at the number of overlapping differentially expressed genes (DEGs) between the two samples A and B data sets. We also check for consistency between methods on the same data (sample A or sample B data set).\

First, we define DEGs as genes with adjusted p-value < 0.05 and |logFC| > 1 in samples A and B data sets which are unadjusted, RUV4- or RUVinv-adjusted.

```{r}
##------ In sample A data
DEGsUnadj_sampleA <- row.names(fit_unadj_emp_sampleA.summary$C)[fit_unadj_emp_sampleA.summary$C$F.p.BH < 0.05 & abs(fit_unadj_emp_sampleA.summary$C$b_X1) > 1]

DEGsRUVinv_sampleA <- row.names(fit_ruvin_emp_samplesA.summary$C)[fit_ruvin_emp_samplesA.summary$C$F.p.BH < 0.05 & abs(fit_ruvin_emp_samplesA.summary$C$b_X1) > 1]

DEGsRUV4_sampleA <- row.names(fit_ruv4_emp_sampleA.summary$C)[fit_ruv4_emp_sampleA.summary$C$F.p.BH < 0.05 & abs(fit_ruv4_emp_sampleA.summary$C$b_X1) > 1]  

##------ In sample B data:
DEGsUnadj_sampleB <- row.names(fit_unadj_emp_sampleB.summary$C)[fit_unadj_emp_sampleB.summary$C$F.p.BH < 0.05 & abs(fit_unadj_emp_sampleB.summary$C$b_X1) > 1]

DEGsRUVinv_sampleB<- row.names(fit_ruvin_emp_samplesB.summary$C)[fit_ruvin_emp_samplesB.summary$C$F.p.BH < 0.05 & abs(fit_ruvin_emp_samplesB.summary$C$b_X1) > 1] 

DEGsRUV4_sampleB <- row.names(fit_ruv4_emp_sampleB.summary$C)[fit_ruv4_emp_sampleB.summary$C$F.p.BH < 0.05 & abs(fit_ruv4_emp_sampleB.summary$C$b_X1) > 1]  
```

### DEGs in the unadjusted data
Venn diagram comparing the unadjusted samples A and B data.
```{r Venn-DEGs-unadj}
allDEGs_unadj <- c(DEGsUnadj_sampleA, 
                    DEGsUnadj_sampleB)

## remove duplicated gene symbols:
allDEGs_unadj <- allDEGs_unadj[!duplicated(allDEGs_unadj)]  

## Draw a Venn diagram comparing DEGs for RUVinv
Counts_unadj <- matrix(0, nrow = length(allDEGs_unadj), ncol = 2)
row.names(Counts_unadj)<- allDEGs_unadj
colnames(Counts_unadj)<- c("Unadj_A","Unadj_B")

for( i in 1:length(allDEGs_unadj)) {
  Counts_unadj[i,1]<- allDEGs_unadj[i] %in% DEGsUnadj_sampleA
  Counts_unadj[i,2]<- allDEGs_unadj[i] %in% DEGsUnadj_sampleB
}

col <- c("blue", "violet")
vennDiagram(vennCounts(Counts_unadj), 
            circle.col = col, 
            cex = c(1.6, 1.2, 1), lwd=2)

```

### DEGs in the RUVinv-adjusted data
Venn diagram comparing the RUVinv-adjusted samples A and B data.
```{r Venn-DEGs-RUVinv}

allDEGs_RUVinv <- c(DEGsRUVinv_sampleA, 
                    DEGsRUVinv_sampleB)

## remove duplicated gene symbols:
allDEGs_RUVinv <- allDEGs_RUVinv[!duplicated(allDEGs_RUVinv)]  

## Draw a Venn diagram comparing DEGs for RUVinv
Counts_RUVinv <- matrix(0, nrow = length(allDEGs_RUVinv), ncol = 2)
row.names(Counts_RUVinv)<- allDEGs_RUVinv
colnames(Counts_RUVinv)<- c("RUVinv_A","RUVinv_B")

for( i in 1:length(allDEGs_RUVinv)) {
  Counts_RUVinv[i,1]<- allDEGs_RUVinv[i] %in% DEGsRUVinv_sampleA
  Counts_RUVinv[i,2]<- allDEGs_RUVinv[i] %in% DEGsRUVinv_sampleB
}

col<- c("blue", "violet")
vennDiagram(vennCounts(Counts_RUVinv), 
            circle.col = col,
            cex = c(1.6, 1.2, 1), lwd = 2)

```

### DEGs in the RUV4-adjusted data
Venn diagram comparing the RUV4-adjusted samples A and B data.
```{r Venn-DEGs-RUV4}

allDEGs_RUV4<- c(DEGsRUV4_sampleA, DEGsRUV4_sampleB)

## remove duplicated gene symbols:
allDEGs_RUV4<- allDEGs_RUV4[!duplicated(allDEGs_RUV4)]  

## Draw a Venn diagram comparing DEGs for RUV4
Counts_RUV4 <- matrix(0, nrow= length(allDEGs_RUV4), ncol=2)
row.names(Counts_RUV4)<- allDEGs_RUV4
colnames(Counts_RUV4)<- c("RUV4_A","RUV4_B")

for( i in 1:length(allDEGs_RUV4)) {
  Counts_RUV4[i,1]<- allDEGs_RUV4[i] %in% DEGsRUV4_sampleA
  Counts_RUV4[i,2]<- allDEGs_RUV4[i] %in% DEGsRUV4_sampleB
}

col<- c("blue", "violet")
vennDiagram(vennCounts(Counts_RUV4), 
            circle.col = col, 
            cex = c(1.6, 1.2, 1), lwd = 2)
```

### Compare the unadjusted, RUV4 and RUVinv-adjusted in samples A data
```{r Venn-DEGs-samplesA}

allDEGs_sampleA <- c(DEGsUnadj_sampleA, DEGsRUVinv_sampleA, DEGsRUV4_sampleA)

## remove duplicated gene symbols:
allDEGs_sampleA <- allDEGs_sampleA[!duplicated(allDEGs_sampleA)]  

## Draw a Venn diagram comparing DEGs for sample A
Counts_sampleA <- matrix(0, nrow= length(allDEGs_sampleA), ncol=3)
row.names(Counts_sampleA)<- allDEGs_sampleA
colnames(Counts_sampleA)<- c("Unadj_A", "RUVinv_A", "Ruv4_A")

for( i in 1:length(allDEGs_sampleA)) {
  Counts_sampleA[i,1]<- allDEGs_sampleA[i] %in% DEGsUnadj_sampleA
  Counts_sampleA[i,2]<- allDEGs_sampleA[i] %in% DEGsRUVinv_sampleA
  Counts_sampleA[i,3]<- allDEGs_sampleA[i] %in% DEGsRUV4_sampleA
}

col<- c("blue", "violet", "darkgreen")
vennDiagram(vennCounts(Counts_sampleA), 
            circle.col=col, 
            cex=c(1.6, 1.2, 1), lwd=2)

```

### Compare the unadjusted, RUV4 and RUVinv-adjusted in samples B data
```{r Venn-DEGs-samplesB}

allDEGs_sampleB <- c(DEGsUnadj_sampleB,
                     DEGsRUVinv_sampleB,
                     DEGsRUV4_sampleB)

## remove duplicated gene symbols:
allDEGs_sampleB <- allDEGs_sampleB[!duplicated(allDEGs_sampleB)] 

## Draw a Venn diagram comparing DEGs for sample B
Counts_sampleB <- matrix(0, nrow = length(allDEGs_sampleB), ncol = 3)
row.names(Counts_sampleB) <- allDEGs_sampleB
colnames(Counts_sampleB) <- c("Unadj_B", "RUVinv_B","Ruv4_B")

for( i in 1:length(allDEGs_sampleB)) {
  Counts_sampleB[i,1]<- allDEGs_sampleB[i] %in% DEGsUnadj_sampleB
  Counts_sampleB[i,2]<- allDEGs_sampleB[i] %in% DEGsRUVinv_sampleB
  Counts_sampleB[i,3]<- allDEGs_sampleB[i] %in% DEGsRUV4_sampleB
}

col <- c("blue", "violet", "darkgreen")
vennDiagram(vennCounts(Counts_sampleB), 
            circle.col = col, 
            cex = c(1.6, 1.2, 1), lwd = 2)
```







